GLM Part3

Author

Murray Logan

Published

30/07/2023

1 Preparations

Load the necessary libraries

library(glmmTMB)       #for model fitting
library(car)           #for regression diagnostics
library(broom)         #for tidy output
library(ggfortify)     #for model diagnostics
library(DHARMa)        #for residual diagnostics
library(see)           #for plotting residuals
library(knitr)         #for kable
library(effects)       #for partial effects plots
library(ggeffects)     #for partial effects plots
library(emmeans)       #for estimating marginal means
library(modelr)        #for auxillary modelling functions
library(tidyverse)     #for data wrangling
library(lindia)        #for diagnostics of lm and glm
library(performance)   #for residuals diagnostics
library(sjPlot)        #for outputs
library(report)        #for reporting methods/results
library(easystats)     #framework for stats, modelling and visualisation
library(MASS)          #for negative binomials
library(MuMIn)         #for AICc

2 Scenario

Here is a modified example from Peake and Quinn (1993). Peake and Quinn (1993) investigated the relationship between the number of individuals of invertebrates living in amongst clumps of mussels on a rocky intertidal shore and the area of those mussel clumps.

Figure 1: Mussels
Table 1: Format of peakquinn.csv data files
AREA INDIV
516.00 18
469.06 60
462.25 57
938.60 100
1357.15 48
... ...
Table 2: Description of the variables in the peake data file
AREA Area of mussel clump mm2 - Predictor variable
INDIV Number of individuals found within clump - Response variable

The aim of the analysis is to investigate the relationship between mussel clump area and the number of non-mussel invertebrate individuals supported in the mussel clump.

3 Read in the data

peake <- read_csv("../public/data/peakquinn.csv", trim_ws = TRUE)
Rows: 25 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (2): AREA, INDIV

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
peake |> glimpse()
Rows: 25
Columns: 2
$ AREA  <dbl> 516.00, 469.06, 462.25, 938.60, 1357.15, 1773.66, 1686.01, 1786.…
$ INDIV <dbl> 18, 60, 57, 100, 48, 118, 148, 214, 225, 283, 380, 278, 338, 274…
## Explore the first 6 rows of the data
peake |> head()
# A tibble: 6 × 2
   AREA INDIV
  <dbl> <dbl>
1  516     18
2  469.    60
3  462.    57
4  939.   100
5 1357.    48
6 1774.   118
peake |> str()
spc_tbl_ [25 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ AREA : num [1:25] 516 469 462 939 1357 ...
 $ INDIV: num [1:25] 18 60 57 100 48 118 148 214 225 283 ...
 - attr(*, "spec")=
  .. cols(
  ..   AREA = col_double(),
  ..   INDIV = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
peake |> datawizard::data_codebook()
peake (25 rows and 2 variables, 2 shown)

ID | Name  | Type    | Missings |          Values |  N
---+-------+---------+----------+-----------------+---
1  | AREA  | numeric | 0 (0.0%) | [462.25, 27144] | 25
---+-------+---------+----------+-----------------+---
2  | INDIV | numeric | 0 (0.0%) |      [18, 1402] | 25
------------------------------------------------------

4 Exploratory data analysis

Model formula: \[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ ln(\lambda_i) = \beta_0 + \beta_1 ln(x_i) \]

where the number of individuals in the \(i^th\) observation is assumed to be drawn from a Poisson distribution with a \(\lambda\) (=mean) of \(\lambda_i\). The natural log of these expected values is modelled against a linear predictor that includes an intercept (\(\beta_0\)) and slope (\(\beta_i\)) for natural log transformed area. expected values are

ggplot(peake, aes(y=INDIV, x=AREA)) +
  geom_point()+
  geom_smooth()

Conclusions:

  • there is some evidence of non-homogeneity of variance (observations are more tightly clustered around the trendline for small values of AREA and the spread increases throughout the trend).
  • there is some evidence of non-linearity as evidenced by the substantial change in trajectory of the loess smoother.
  • nevertheless, there is also evidence of non-normality in both the y-axis and x-axis (there are more points at low values of both y and x). Deviations from normality can make it difficult to assess other assumptions. Often the cause of homogeneity and linearity is non-homogeneity.
ggplot(peake, aes(y=INDIV)) + geom_boxplot()

ggplot(peake, aes(y=AREA)) + geom_boxplot()

Conclusions:

  • indeed both the response (INDIV) and predictor (AREA) appear to be skewed. It is not surprising that the response is skewed as it represents counts of the number of individuals. We might expect that this should follow a Poisson distribution. Poisson distributions must be bounded by 0 at the lower end and hence as the expected value (=mean and location) of a Poisson distribution approaches 0, the distribution will become more and more asymmetrical. A the same time, the distribution would be expected to become narrower (have a smaller variance) - since in a Poisson distribution the mean and variance are equal.

Along with modelling these data against a Poisson distribution (with a natural log link), we should probably attempt to normalise the predictor variable (AREA). Whilst linear models don’t assume that the predictor variable follows a normal distribution (it is typically assumed to be a uniform distribution), it is assumed to be symmetrical. In this case, a natural log transformation might help achieve this. At the same time, it might also help homogeneity of variance and linearity.

We can mimic the action of using a log link and log-transformed predictor by presenting the scatterplot on log-transformed axes.

ggplot(peake, aes(y=INDIV, x=AREA)) +
  geom_point()+
  geom_smooth() +
  scale_y_log10() +
  scale_x_log10()

Conclusions:

  • there is no obvious violations of the assumptions now.

5 Fit the model

\[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 ln(x_i) \]

peake.lm <- lm(INDIV~AREA, data=peake) 

Model validation

peake.lm |> autoplot(which=1:6)

Conclusions:

  • there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
  • the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
  • there is an observation with a high Cook’s distance.
peake.lm |> influence.measures()
Influence measures of
     lm(formula = INDIV ~ AREA, data = peake) :

    dfb.1_ dfb.AREA    dffit cov.r   cook.d    hat inf
1  -0.1999  0.13380 -0.20006 1.125 2.04e-02 0.0724    
2  -0.1472  0.09887 -0.14731 1.150 1.12e-02 0.0728    
3  -0.1507  0.10121 -0.15073 1.148 1.17e-02 0.0728    
4  -0.1153  0.07466 -0.11549 1.155 6.92e-03 0.0687    
5  -0.1881  0.11765 -0.18896 1.117 1.82e-02 0.0653    
6  -0.1214  0.07306 -0.12237 1.142 7.75e-03 0.0622    
7  -0.0858  0.05206 -0.08639 1.155 3.88e-03 0.0628    
8  -0.0176  0.01059 -0.01776 1.165 1.65e-04 0.0621    
9  -0.0509  0.02633 -0.05236 1.150 1.43e-03 0.0535    
10 -0.0237  0.01069 -0.02504 1.148 3.28e-04 0.0489    
11  0.0475 -0.01961  0.05096 1.141 1.35e-03 0.0470    
12 -0.0418  0.01717 -0.04492 1.142 1.05e-03 0.0468    
13 -0.0061  0.00221 -0.00672 1.144 2.36e-05 0.0448    
14 -0.0453  0.01859 -0.04862 1.142 1.23e-03 0.0468    
15  0.1641 -0.05100  0.18584 1.067 1.74e-02 0.0433    
16  0.0472 -0.00254  0.06317 1.129 2.08e-03 0.0401    
17  0.1764 -0.01859  0.22781 1.021 2.57e-02 0.0403    
18  0.0542  0.01493  0.09093 1.120 4.28e-03 0.0411    
19  0.2173  0.12011  0.43430 0.808 8.29e-02 0.0433    
20 -0.0701 -0.02168 -0.12016 1.106 7.43e-03 0.0413    
21  0.1849  0.63635  1.07759 0.357 3.36e-01 0.0614   *
22  0.0752 -0.33126 -0.39474 1.157 7.79e-02 0.1352    
23 -0.0789  0.22611  0.25071 1.362 3.25e-02 0.2144   *
24  0.0853 -0.21744 -0.23573 1.473 2.88e-02 0.2681   *
25  0.4958 -1.32133 -1.44477 0.865 8.44e-01 0.2445   *
peake.lm |> performance::check_model()

Conclusions:

  • there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
  • the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
  • there is an observation with a high Cook’s distance.
peake.resid <- peake.lm |> simulateResiduals(plot=TRUE)

peake.resid |> testResiduals()

$uniformity

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.18, p-value = 0.3927
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 0.96392, p-value = 0.952
alternative hypothesis: two.sided


$outliers

    DHARMa outlier test based on exact binomial test with approximate
    expectations

data:  simulationOutput
outliers at both margin(s) = 1, observations = 25, p-value = 0.1813
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
 0.0010122 0.2035169
sample estimates:
frequency of outliers (expected: 0.00796812749003984 ) 
                                                  0.04 
$uniformity

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.18, p-value = 0.3927
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 0.96392, p-value = 0.952
alternative hypothesis: two.sided


$outliers

    DHARMa outlier test based on exact binomial test with approximate
    expectations

data:  simulationOutput
outliers at both margin(s) = 1, observations = 25, p-value = 0.1813
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
 0.0010122 0.2035169
sample estimates:
frequency of outliers (expected: 0.00796812749003984 ) 
                                                  0.04 

Conclusions:

  • there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance

Please go to the Poisson (GLM) tab.

\[ y_i \sim{} \mathcal{Pois}(\lambda_i)\\ ln(\lambda_i) = \beta_0 + \beta_1 ln(x_i) \]

peake.glm <- glm(INDIV ~ log(AREA), data=peake, family=poisson(link='log'))

Model validation

peake.glm |> autoplot(which=1:6)

Conclusions:

  • there is still clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
  • the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
  • there is an observation with a high Cook’s distance.
peake.glm |> influence.measures()
Influence measures of
     glm(formula = INDIV ~ log(AREA), family = poisson(link = "log"),      data = peake) :

     dfb.1_ dfb.l.AR   dffit cov.r  cook.d    hat inf
1  -0.27480  0.26720 -0.2810 1.073  1.8302 0.0710   *
2  -0.04611  0.04488 -0.0471 1.173  0.0736 0.0705    
3  -0.05602  0.05453 -0.0572 1.171  0.1071 0.0704    
4  -0.04819  0.04647 -0.0502 1.174  0.0845 0.0720    
5  -0.31731  0.30365 -0.3375 1.028  2.7532 0.0698   *
6  -0.14873  0.14126 -0.1620 1.133  0.7980 0.0668   *
7  -0.05948  0.05659 -0.0645 1.166  0.1386 0.0675    
8   0.07488 -0.07110  0.0816 1.161  0.2477 0.0667    
9  -0.06040  0.05587 -0.0727 1.151  0.1760 0.0575    
10 -0.03779  0.03419 -0.0500 1.149  0.0848 0.0527    
11  0.04660 -0.04160  0.0653 1.143  0.1555 0.0508    
12 -0.06833  0.06095 -0.0961 1.133  0.3023 0.0507    
13 -0.02679  0.02345 -0.0408 1.146  0.0569 0.0489    
14 -0.07303  0.06515 -0.1027 1.131  0.3433 0.0507    
15  0.14202 -0.12170  0.2359 1.040  2.1279 0.0477   *
16  0.01114 -0.00816  0.0302 1.145  0.0327 0.0467    
17  0.10458 -0.07980  0.2557 1.018  2.4750 0.0465   *
18  0.00878 -0.00353  0.0508 1.145  0.0930 0.0501    
19  0.03006  0.01642  0.4473 0.857  7.2424 0.0536   *
20 -0.04114  0.01412 -0.2614 1.028  1.9833 0.0505   *
21 -0.21773  0.31062  0.9321 0.530 25.9624 0.0742   *
22  0.27322 -0.31834 -0.5255 1.087  8.0430 0.1366   *
23 -0.06181  0.06967  0.1003 1.347  0.3584 0.1916   *
24  0.16956 -0.18902 -0.2594 1.382  2.2615 0.2255   *
25  0.84423 -0.94513 -1.3210 0.824 39.5373 0.2109   *
peake.glm |> performance::check_model()

peake.glm |> performance::check_overdispersion()
# Overdispersion test

       dispersion ratio =   70.410
  Pearson's Chi-Squared = 1619.441
                p-value =  < 0.001
peake.glm |> performance::check_zeroinflation()
Model has no observed zeros in the response variable.
NULL
## Note, the following cannot be piped for the plot to work!
#performance::check_normality(peake.glm) |> plot()
peake.glm |> performance::check_outliers()
11 outliers detected: cases 1, 5, 6, 15, 17, 19, 20, 21, 22, 24, 25.
- Based on the following method and threshold: cook (0.714).
- For variable: (Whole model).

Conclusions:

  • there is clear evidence of a wedge (funnel) shape in the residuals indicative of non-homogeneity of variance
  • the tails of the Q-Q plot deviate substantially from the ideal line suggesting that the observed data are not normally distributed.
  • there are observations with a high Cook’s distance.
peake.resid <- peake.glm |> simulateResiduals(plot=TRUE)

peake.resid |> testResiduals()

$uniformity

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.54099, p-value = 8.826e-07
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 105.43, p-value < 2.2e-16
alternative hypothesis: two.sided


$outliers

    DHARMa bootstrapped outlier test

data:  simulationOutput
outliers at both margin(s) = 13, observations = 25, p-value < 2.2e-16
alternative hypothesis: two.sided
 percent confidence interval:
 0.00 0.04
sample estimates:
outlier frequency (expected: 0.0112 ) 
                                 0.52 
$uniformity

    Asymptotic one-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.54099, p-value = 8.826e-07
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 105.43, p-value < 2.2e-16
alternative hypothesis: two.sided


$outliers

    DHARMa bootstrapped outlier test

data:  simulationOutput
outliers at both margin(s) = 13, observations = 25, p-value < 2.2e-16
alternative hypothesis: two.sided
 percent confidence interval:
 0.00 0.04
sample estimates:
outlier frequency (expected: 0.0112 ) 
                                 0.52 

Conclusions:

  • there is clear evidence of issues with the residual plot (including outliers)
  • the Q-Q plot does not closely follow the ideal line
  • there is evidence of violations of distributional conformity (KS test), Dispersion and Outliers
##Check the model for lack of fit via:
##Pearson chisq
peake.ss <- sum(resid(peake.glm, type = "pearson")^2)
1 - pchisq(peake.ss, peake.glm$df.resid)
[1] 0
##Evidence of a lack of fit

#Deviance
1-pchisq(peake.glm$deviance, peake.glm$df.resid)
[1] 0
#Evidence of a lack of fit
peake.ss/peake.glm$df.resid
[1] 70.41047
peake.glm$deviance/peake.glm$df.resid
[1] 67.29376

The model appears to be over-dispersed. That is, the Poisson distribution would expect (assume) that the variance is equal to the mean. The diagnostics suggest that there is more variability in the response than the Poisson model would expect.

This could be due to:

  • the model being too simple. Linear models are intentionally low-dimensional representations of the system. They do not, and can not represent all the complexity in the underlying system. Instead they are intended to provide very targeted insights into a small number of potential influences. Nevertheless, the model might still be too simplistic. Perhaps if we were able to add an additional covariate, we might be able to explain the additional variation. For example, in this example, although it might be reasonable to expect that the number of individuals in a mussel clump should be driven by a Poisson process, the mussel clump area alone might not be sufficient to capture the variance reasonably. Perhaps if we were able to include additional information, such as the position of the clump along the shore (tidal influence) or orientation on the rock etc, we might be able to explain more of the currently unexplained variation.

  • in the absence of additional covariates, it is possible to add a special type of dummy covariate that is like a proxy for all additional covariates that we could add. This dummy variable is called a unit-level (or Observation-level) random effect. It essentially soaks up the additional variance.

  • another source of additional variation is when the objects being counted have a tendency to aggregate (clumped). When this is the case, the sampled data tends to be more varied since any single sample is likely to either less or more than the overall average number of items. Hence the average of the samples might turn out to be similar to a more homogeneous population, yet it will have higher variance.

    The Negative Binomial distribution is able to accommodate clumpy data. Rather than assume that the variance and mean are equal (dispersion of 1), it estimates dispersion as an additional parameter. Together the two parameters of the Negative Binomial can be used to estimate the location (mean) and spread (variance) of the distribution.

  • yet another cause of over-dispersion is the presence of excessive zeros. In particular, some of these zeros could be false zeros. That is, a zero was recorded (no individuals observed), yet there one or more individuals were actually present (just not detected). This could imply that the observed data are generated by two processes. One that determines the actual number of items that exist and then another that determines the destructibility of the items. For these circumstances, we can fit zero-inflated models. Zero-inflated models are a mixture of two models, one representing the count process and the other representing the imperfect detection process.

Please go to the Negative Binomial (GLM) tab.

\[ y_i \sim{} \mathcal{NegBin}(\lambda_i, \theta)\\ ln(\lambda_i) = \beta_0 + \beta_1 ln(x_i) \]

peake.glm1 <- glm.nb(INDIV ~ log(AREA), data=peake)
## lets also fit a model in which we have centered the predictor to see the impact on estimated coefficients.
peake.glm2 <- glm.nb(INDIV ~ scale(log(AREA), scale=FALSE), data=peake)

Model validation

peake.glm1 |> autoplot(which=1:6)

Conclusions:

  • there is no longer clear evidence of a wedge (funnel) shape in the residuals suggesting that the assumption of homogeneity of variance is satisfied
  • the tails of the Q-Q plot are an improvement on the Poisson.
  • there are no longer any large Cook’s distances.
peake.glm1 |> influence.measures()
Influence measures of
     glm.nb(formula = INDIV ~ log(AREA), data = peake, init.theta = 7.369263003,      link = log) :

      dfb.1_  dfb.l.AR   dffit cov.r   cook.d    hat inf
1  -1.054875  0.987692 -1.1293 0.748 0.313001 0.1554   *
2   0.206875 -0.194302  0.2200 1.279 0.031769 0.1645   *
3   0.162776 -0.152952  0.1729 1.293 0.019220 0.1659   *
4   0.096225 -0.087632  0.1105 1.207 0.007770 0.1033    
5  -0.505155  0.446373 -0.6335 0.801 0.113572 0.0777    
6  -0.105801  0.090193 -0.1480 1.133 0.010967 0.0629    
7   0.016821 -0.014457  0.0230 1.169 0.000318 0.0655    
8   0.179367 -0.152721  0.2519 1.071 0.045529 0.0626    
9  -0.010516  0.007231 -0.0249 1.142 0.000356 0.0440    
10 -0.003083  0.001214 -0.0134 1.139 0.000105 0.0408    
11  0.013908 -0.000181  0.0976 1.116 0.006282 0.0406    
12 -0.009512 -0.000223 -0.0692 1.128 0.002572 0.0406    
13 -0.000763 -0.001702 -0.0175 1.139 0.000177 0.0411    
14 -0.010506 -0.000234 -0.0764 1.125 0.003099 0.0406    
15 -0.008622  0.042042  0.2384 1.018 0.041811 0.0420    
16 -0.006407  0.009578  0.0239 1.148 0.000345 0.0488    
17 -0.054463  0.085385  0.2303 1.044 0.038438 0.0474    
18 -0.009648  0.012719  0.0245 1.157 0.000363 0.0561    
19 -0.145174  0.184432  0.3238 1.009 0.078107 0.0608    
20  0.114479 -0.149993 -0.2847 1.029 0.033283 0.0568    
21 -0.282767  0.335956  0.4884 0.932 0.183075 0.0781    
22  0.304509 -0.346099 -0.4398 1.066 0.077348 0.1084    
23  0.089374 -0.100144 -0.1219 1.240 0.008049 0.1270    
24  0.221731 -0.247052 -0.2958 1.205 0.041771 0.1367    
25  0.579077 -0.646655 -0.7793 0.904 0.187567 0.1326    
peake.glm1 |> performance::check_model()

Conclusions:

  • there is no longer clear evidence of a wedge (funnel) shape in the residuals suggesting that the assumption of homogeneity of variance is satisfied.
  • the tails of the Q-Q plot are an improvement on the Poisson.
  • there are no longer any large (>0.8) Cook’s distances.
peake.resid <- peake.glm1 |> simulateResiduals(plot=TRUE)

peake.resid |> testResiduals()

$uniformity

    Exact one-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.11352, p-value = 0.8684
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 1.1852, p-value = 0.496
alternative hypothesis: two.sided


$outliers

    DHARMa bootstrapped outlier test

data:  simulationOutput
outliers at both margin(s) = 0, observations = 25, p-value = 1
alternative hypothesis: two.sided
 percent confidence interval:
 0.00 0.04
sample estimates:
outlier frequency (expected: 0.0116 ) 
                                    0 
$uniformity

    Exact one-sample Kolmogorov-Smirnov test

data:  simulationOutput$scaledResiduals
D = 0.11352, p-value = 0.8684
alternative hypothesis: two-sided


$dispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs.
    simulated

data:  simulationOutput
dispersion = 1.1852, p-value = 0.496
alternative hypothesis: two.sided


$outliers

    DHARMa bootstrapped outlier test

data:  simulationOutput
outliers at both margin(s) = 0, observations = 25, p-value = 1
alternative hypothesis: two.sided
 percent confidence interval:
 0.00 0.04
sample estimates:
outlier frequency (expected: 0.0116 ) 
                                    0 

Conclusions:

  • there is no longer clear evidence of issues with the residual plot.
  • the Q-Q plot is a much better match to the ideal line
  • there is no longer any evidence of violations of distributional conformity (KS test), Dispersion and Outliers
##Check the model for lack of fit via:
##Pearson chisq
peake.ss <- sum(resid(peake.glm1, type = "pearson")^2)
1 - pchisq(peake.ss, peake.glm1$df.resid)
[1] 0.4897527
##Evidence of a lack of fit

#Deviance
1-pchisq(peake.glm1$deviance, peake.glm1$df.resid)
[1] 0.3036601
#Evidence of a lack of fit

Conclusions:

  • there is no evidence of a lack of fit (p>0.05)
peake.ss/peake.glm1$df.resid
[1] 0.9786343

Conclusions:

  • there is no longer any evidence of over-dispersion

The model diagnostics suggest that the Negative Binomial model is more appropriate - since it satisfied the assumptions whereas the Poisson did not. In this case, this should be the overriding factor in selecting between the two models. If however, both were found to be valid, we could use information criteria to help us chose between the two.

Information criterion provide a relative measure of the fit of the model penalising for complexity (e.i. a balance between goodness of fit and model simplicity). The actual value of an AIC by itself is of no real use. It is only useful as a comparative measure between two or more related models. For this purpose, the lower AIC is considered better.

One of the most widely used information criterion is the Akaike Information Criterion (AIC). The AIC is a measure of in-sample prediction error that penalises complexity:

\[ AIC = 2k - 2ln(\hat{L}) \] where \(k\) is the number of parameters being estimated and \(\hat{L}\) is the maximum likelihood of the fitted model.

As a general rule, if two AIC’s are within 2 units of each other, they are considered to be not significantly different from one another. In that case, you would select the model that is simplest (lower used degrees of freedom).

AIC(peake.glm, peake.glm1)
           df       AIC
peake.glm   2 1738.9471
peake.glm1  3  312.1603
## For small sample sizes,  it is better to use AICc - this is
## corrected for small sample sizes.
AICc(peake.glm, peake.glm1)
           df      AICc
peake.glm   2 1739.4926
peake.glm1  3  313.3032

Conclusions:

  • purely on the basis of AIC, we would also have selected the Negative Binomial model (peake.glm1) over the Poisson model (peake.glm) as it has a substantially lower (>2) AIC.

Unfortunately, these seem to be broken at the moment…

## The following is equivalent to ggeffect
peake.glm1 |> plot_model(type = 'eff', show.data = TRUE,  terms = 'AREA') 

## plot_model(peake.glm1, type='eff', show.data=FALSE,  terms='AREA [log]') 
## plot_model(peake.glm1, type='eff', show.data=FALSE,  terms='AREA [exp]') 
## The following is equivalent to ggpredict
#plot_model(peake.glm1, type='pred',  show_data=TRUE, terms='AREA [exp]')
## The following is equivalent to ggemmeans
## plot_model(peake.glm1, type='emm',  terms='AREA [exp]')
peake.glm1 |> allEffects(residuals = TRUE) |> plot(type = 'response')

peake.glm1 |> ggpredict() |> plot(add.data = TRUE, jitter = FALSE)
$AREA

The ggpredict() |> plot() combination produces a list of ggplot objects. If we want to make adjustments to the ggplot object, we need to specify a single term in ggpredict().

## If you want to alter 
peake.glm1 |> ggpredict(term = 'AREA') |> plot(add.data = TRUE, jitter = FALSE) +
    scale_y_log10() +
    scale_x_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

peake.glm1 |> ggemmeans(~AREA) |> plot(add.data = TRUE, jitter = FALSE)

peake.glm1 |> ggemmeans(~AREA) |> plot(add.data = TRUE, jitter = FALSE) +
    scale_y_log10() +
    scale_x_log10()
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

6 Model investigation / hypothesis testing

peake.glm1 |> summary()

Call:
glm.nb(formula = INDIV ~ log(AREA), data = peake, init.theta = 7.369263003, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.16452    0.53387  -2.181   0.0292 *  
log(AREA)    0.82469    0.06295  13.102   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(7.3693) family taken to be 1)

    Null deviance: 161.076  on 24  degrees of freedom
Residual deviance:  25.941  on 23  degrees of freedom
AIC: 312.16

Number of Fisher Scoring iterations: 1

              Theta:  7.37 
          Std. Err.:  2.16 

 2 x log-likelihood:  -306.16 

Conclusions:

  • the estimated y-intercept (value of Number of Individuals when log AREA is 0) is -1.16. Note that this is still on the link (log) scale. Note also, that as a log AREA value of 0 is outside the range of the collected data, this intercept does not really have any useful interpretation (unless we are happy to extrapolate). If we had centred the Area predictor (either before modelling or as part of the model - glm.nb(INDIV~scale(log(AREA),scale=FALSE), data=peake)), then the y-intercept would have had a sensible interpretation. It would have been the expected Individuals at the average clump Area. Nevertheless, either way, the hypothesis test associated with the intercept is rarely of any value.
  • the estimated slope is 0.82. For every one unit increase in log Area, the number of individuals (on a log scale) is expected to increase by 0.82 units.
  • if we back transform the slope (by exponentiation), we get a slope of 2.28. This is interpreted as - for every 1 unit increase in (log) Area, the number of individuals increases 2.28 fold. That is, there is a 128 percent increase per 1 unit increase in log Area.
  • the estimated slope divided by the estimated uncertainty in the slope (Std. Error) gives a t-value of 13.1. If we compare that to a t-distribution for 23, we get a p-value < 0.05. Hence we would reject the null hypothesis of no relationship between Area and number of Individuals.
  • the \(R^2\) value is 0.86. Hence, 86 percent of the variation in number of Individuals is explained by its relationship to Area.
peake.glm1 |> confint()
Waiting for profiling to be done...
                2.5 %      97.5 %
(Intercept) -2.257255 -0.04087877
log(AREA)    0.693092  0.95477695
## or on the response scale
peake.glm1 |>
  confint() |>
  exp()
Waiting for profiling to be done...
                2.5 %    97.5 %
(Intercept) 0.1046373 0.9599455
log(AREA)   1.9998896 2.5980910

Conclusions:

  • the confidence intervals provide more context about the estimated effects of Area on number of individuals. In addition to describing the implications associated with the typical effect (as indicated by the point estimate of slope), we would also discuss the implications is the effect was a low as a 2 fold increase or as high as a 2.6 fold increase.
peake.glm1 |> tidy(conf.int=TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)   -1.16     0.534      -2.18 2.92e- 2   -2.26    -0.0409
2 log(AREA)      0.825    0.0629     13.1  3.22e-39    0.693    0.955 
peake.glm1 |> tidy(conf.int=TRUE, exponentiate=TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)    0.312    0.534      -2.18 2.92e- 2    0.105     0.960
2 log(AREA)      2.28     0.0629     13.1  3.22e-39    2.00      2.60 
peake.glm1 |> glance()
# A tibble: 1 × 8
  null.deviance df.null logLik      AIC   BIC deviance df.residual  nobs
          <dbl>   <int> <logLik>  <dbl> <dbl>    <dbl>       <int> <int>
1          161.      24 -153.0802  312.  316.     25.9          23    25
# warning this is only appropriate for html output
peake.glm1 |> sjPlot::tab_model(show.se = TRUE, show.aic = TRUE)
  INDIV
Predictors Incidence Rate Ratios std. Error CI p
(Intercept) 0.31 0.17 0.10 – 0.96 0.029
AREA [log] 2.28 0.14 2.00 – 2.60 <0.001
Observations 25
R2 Nagelkerke 0.997
AIC 312.160

7 Predictions

#R2
1-(peake.ss/peake.glm1$null)
[1] 0.8602608
## Or based on deviance (preferred)
1-(peake.glm1$deviance/peake.glm1$null)
[1] 0.8389516

\[ R^2 = 1 - exp(-2/n * logL(x) - logL(0)) \] where \(logL(x)\) and \(logL(0)\) are the log-likelihoods of the fitted and null models respectively. This is then sometimes adjusted (Nagelkerke’s method) such that: \[ max(R^2) = 1 - exp(2 / n * logL(0)) \] because sometimes the max R^2$ is less than one.

peake.glm1 |> r.squaredLR()
[1] 0.855129
attr(,"adj.r.squared")
[1] 0.8551296
peake.glm1 |> performance::r2_nagelkerke()
Nagelkerke's R2 
      0.9970946 

8 Summary figures

## Using emmeans
peake.grid <- with(peake, list(AREA=seq(min(AREA), max(AREA), len=100)))
#OR
peake.grid <- peake |>
    data_grid(AREA=seq_range(AREA,  n=100))
newdata <- peake.glm1 |>
  emmeans(~AREA, at = peake.grid, type = "response") |>
  as.data.frame()
head(newdata)
      AREA  response        SE  df asymp.LCL asymp.UCL
  462.2500  49.19889  7.916538 Inf  35.89132  67.44057
  731.7629  71.85811  9.773036 Inf  55.04380  93.80870
 1001.2759  93.06469 11.171565 Inf  73.55395 117.75081
 1270.7888 113.28088 12.317959 Inf  91.53738 140.18926
 1540.3017 132.75321 13.318842 Inf 109.05506 161.60108
 1809.8146 151.63405 14.238678 Inf 126.14428 182.27450

Confidence level used: 0.95 
Intervals are back-transformed from the log scale 
ggplot(newdata, aes(y=response, x=AREA)) +
    geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL),fill='blue', alpha=0.3) +
    geom_line() +
    theme_classic()

ggplot(newdata, aes(y=response, x=AREA)) +
    geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL),fill='blue', alpha=0.3) +
    geom_line() +
    geom_point(data=peake, aes(y=INDIV)) +
    scale_x_log10(breaks = as.vector(c(1,2,5,10) %o% 10^(-1:4))) +
    scale_y_log10() +
    theme_classic()

## If we want to plot the partial observations
partial.obs <- peake.glm1 |> emmeans(~AREA, at=peake, type='response') |>
    as.data.frame() |>
    mutate(Partial.obs=response+resid(peake.glm1,type='response'))

## response residuals are just resid * mu.eta(predict)
ggplot(newdata, aes(y=response, x=AREA)) +
    geom_ribbon(aes(ymin=asymp.LCL, ymax=asymp.UCL),fill='blue', alpha=0.3) +
    geom_line() +
    geom_point(data=peake, aes(y=INDIV)) +
    geom_point(data=partial.obs, aes(y=Partial.obs), color='green') + 
    scale_x_log10(breaks = as.vector(c(1,2,5,10) %o% 10^(-1:4))) +
    scale_y_log10() +
    theme_classic() 

9 References

Peake, A. J., and G. P. Quinn. 1993. “Temporal Variation in Species-Area Curves for Invertebrates in Clumps of an Intertidal Mussel.” Ecography 16: 269–77.